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Abstract — Complex networks are at the core of an intense 
research activity. However, in most cases, intricate and costly 
measurement procedures are needed to explore their structure. 
In some cases, these measurements rely on link queries: given two 
nodes, it is possible to test the existence of a link between them. 
These tests may be costly, and thus minimizing their number 
while maximizing the number of discovered links is a key issue. 
This paper studies this problem: we observe that properties 
classically observed on real-world complex networks give hints 
for their efficient measurement; we derive simple principles and 
several measurement strategies based on this, and experimentally 
evaluate their efficiency on real-world cases. In order to do so, 
we introduce methods to evaluate the efficiency of strategies. We 
also explore the bias that different measurement strategies may 
induce. 

I. Preliminaries 

Complex networks, modeled as large graphs, are everywhere 
in science, society, and everyday life. However, it must be 
clear that most real-world complex networks are not directly 
available: collecting information on their structure generally 
relies on intricate and expensive measurement procedures. 
Conducting such a measurement often is a challenge in itself, 
and is an important part of the work needed to study a complex 
network. 

In general, complex network measurements consist in a 
combination of a few simple measurement primitives. In 
several cases, this primitive consists in testing the existence of 
a link, which we call a link query: given two nodes u and v, 
a measurement operation makes it possible to decide whether 
there is a link between them or not. This simple test may be 
expensive (regarding the needed resources or time, or the load 
it induces on the network, for instance) and so conducting 
measurements with as few calls to the measurement primitive 
as possible is a key issue. 

For instance, in online social networks like Facebook or 
FlickrQ, privacy concerns and reduction of server load often 
lead to limitations in the queries that one is allowed to perform 
to explore networks between users. Link queries are however 
allowed in most cases. Likewise, measurements of real-world 
social networks often rely on interviews, in which link queries 
play a central role [1]. In biological networks like protein 
interactions or gene regulatory networks, link queries also play 
a key role [2], [3]. 

1http://www.facebook.coni7] and |http://www.flickr.com/| 



In all these contexts, and others, link queries are very 
expensive: they have a significant load on server running 
online social network software and their number is generally 
bounded; they have a significant cost for interviewers and 
participants in sociological studies ; or they require costly 
biological experiments, depending on the case. 

In this paper, we formalise this problem as follows: given 
a graph G — (V, E), we want to define strategies (ordered 
lists of link queries) which lead to the discovery of as many 
links of the network as possible. In other words, we want to 
minimize the number of link queries while maximizing the 
number of observed links, i.e. the number of positive answers 
to these testsH. 

In order to do so, we will rely on simple intuitions de- 
rived from statistical properties observed on most real-world 
complex networks, which we discuss in Section |ll] We then 
propose several measurement strategies in Section|lIl]based on 
these principles. We also need a way to compare and evaluate 
measurement strategies, see Section |IVl We finally use this to 
experimentally evaluate proposed strategies in Section FV] 

Before entering in the core of this paper, we give the needed 
formalism and notations, and discuss related work. 

A. Formalism and notations 

In all the paper, we will consider an undirected^ graph 
G = {V,E), with n — \V\ nodes and m — \E\ links. 
We suppose that all the nodes are known, and focus on link 
discovery only. In other words, we know V but know nothing 
about E (although we will make some statistical assumptions 
in accordance with classical empirical observations in the field, 
see Section Hit. 

We will denote by N{v) the set of neighbors of v V: 
N{v) ~ {u ^ V, {u,v) e E} and by d{v) its degree: d{v) = 
\N{v)\. 

A measurement consists in a series of link queries, i.e. tests 
of the existence of link {u, v) for two nodes u and v in V. At a 
given stage in such a measurement, one has akeady discovered 
a set of links, which we will denote by E' C E. The set of 

-Notice that, whereas we suppose that link queries are very expensive, the 
computational cost of each strategy is not our concern here; we consider it as 
negligible compared to measurement costs, which fits most real-world cases. 

^This means that we make no difference between (u, v) and {v, u), for 
any u and v. 
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extremities of links in E' will be denoted by V' C V. Notice 
that, although we know V, in general V ^ V. We will also 
denote by n' the number of nodes in V and m' the number 
of discovered links so far: n' = \V'\ and m! = \E'\. We 
also define N'{v) = N{v) n V and d'{v) = \N'{v)\ for all 
V e v. Notice that both V, E', n', m', N' and d' vary during 
a measurement; however, the context will make it clear which 
value we consider. 

B. Related work 

This work belongs to the fields of complex network metrol- 
ogy, which mostly focused on the specific case of the Internet 
topology until now, see for instance [4]-[10]. This area of 
research aims mainly at evaluating the relevance of collected 
complex network samples and properties observed on them, 
and correcting these observations. Viewing the measurement 
as the combination of many instance of a simple primitive 
(link queries, here) which we want to optimize is new, and is 
an important contribution of this paper. 

Another related problem is the one of link prediction: given 
a network in which new links may appear, one wants to predict 
which new links will appear in the future based on currently 
existing ones [11], [12]. In this context, authors use properties 
of the known network to infer probable future link, which is 
similar to what we do below in the measurement context. The 
main difference Ues in the fact that very little of the network 
topology is known in our case. 

11. Underlying principles 

Our goal is to design measurement strategies based on Unk 
queries (test of the existence of a link between two given 
nodes) which will minimize the number of such queries and 
maximize the number of discovered Unks (i.e. the number of 
positive answers to these tests). In order to do so, we will rely 
on some simple statistical properties which are observed on 
most real-world complex networks [13]. 

A. Properties of complex networks 

First, we will suppose that G is sparse: its density S = 
^ is very small. In other words, the probability that a 

hnk exists between two randomly chosen nodes is very small, 
i.e. a random Unk query will fail with high probability. 

The second key property is the fact that most complex 
networks have a very heterogeneous degree distribution (often 
close to a power law). Since the degree of a node is the number 
of links attached to it, this means that there is a high variability 
between the number of hnks of each node (many nodes have 
very few links, but some have more, and even many more). 

Finally, another key property is the local density: although 
randomly chosen nodes have a very low probability to be 
Unked, two nodes which have a neighbor in common are linked 
with a much higher probability. This is generally captured 
by the clustering coefficient or the transitivity ratio [13]-[15], 
defined by: 

cc(G) = ^^^^ 



tr(G) 



3.A(G) 



where, for each v €V, A{v) denotes the number of triangles 
(sets of three nodes with three Unks) to which v belongs; 
V(w) — denotes the number of pairs of neighbors 

of v; A(G) = ^(^); and V(G) = V(z;). 

A classical observation in complex network studies is that 
both these quantities are high, at least compared to the density. 
In other words, if one chooses a random pair of links with an 
extremity in common (transitivity ratio) or a random node and 
two of its neighbors (clustering coefficient) then the probabiUty 
that the third possible link exists is high. 

B. Consequences on measurements 

The properties above, observed on most real-world complex 
networks, have a strong impact on measurements and will play 
a key role here. 

First, the low density of complex network implies that 
randomly choosing two nodes and testing the presence of a 
Unk between them is very inefficient. Notice however that, 
when only link queries are possible, one has no choice but to 
begin with a series of such random measurements. However, 
it must be clear that exploring a large complex network with 
such a strategy only is not reasonable. 

Instead, the existence of nodes with degree much larger than 
the average may be useful for efficient measurement. Suppose 
that we test a random pair (u, v). The probability that it is 
positive {i.e. the link (m, v) exists) is proportional to the degree 
of u (resp. v). Therefore, if it exists then one may guess that 
u (resp. v) has a high degree, and so testing all pairs (u, w) 
(resp. (w, w)) for any w will probably lead to the discovery of 
many links. Notice that u and v play a symmetric role in this 
reasoning. We will call this observation the degree principle. 

Likewise, the high local density may be used for efficient 
measurement: when we know that two nodes u and v have a 
neighbor w in common then testing pair v) certainly makes 
sense as this link exists with high probability. We call this the 
triangle principle. 

We may now turn to the definition of measurement strategies 
based on these principles. 

111. Measurement strategies 

First notice that when one starts a measurement in our 
framework, no link is known and we have no way to dis- 
tinguish between vertices. Therefore, there is no choice but to 
test random pairs of nodes. We call this null strategy randomk. 

Strategy 1: randomk with k an integer. 

while m' < A: do 
|_ test a random untested pair 



As soon as some Unks are discovered, though, one may 
try to design more efficient strategies. The triangle principle 
indicates that, when a V pattern is discovered one may test 
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the missing link in the triangle. This leads to the following 
strategy. 



Strategy 2: W-randonik with k an integer 

while m' < A: do 

Test a random untested pair {u, v) 

if (li, v) exists then 

Test all untested pairs (v, w), for any w in N'{u) 
Test all untested pairs {u,w), for any w in N'{v) 



Applying directly the degree principle would lead to a 
strategy in which we test the pairs {u, v) for all v as soon as a 
random test led to the discovery of a link of u. However, the 
degree principle becomes stronger if one waits until several 
links of a node are found. We therefore propose a strategy 
in which a series of tests (performed according to another 
strategy) is followed by a use of the degree principle on nodes 
for which we discovered many links. 



Strategy 3: (V-)Complete Simple — cs^ (resp. M-csk) with 

k an integer 
Apply randonik (resp. V-random^} 
foreach u eV' in decreasing order of d'{u) do 
|_ Test all untested pairs for any v €V 



This strategy may be improved by using the links it discov- 
ers for choosing the next link queries to perform. This leads 
to the following strategy. 



Strategy 4: (V-)Complete — (resp. V-c^) with k an 
integer. 

Apply random^ (resp. W -random^) 
Let X = V' 

while X is nonempty do 

Let u m X with d'{u) maximal 
Remove u from X 

Test all untested pairs for any v €V 

if {u,v) exists and is the first link of v discovered 

then 

L Add vtoX 



One may try to use an even stronger version of the degree 
principle by noticing that the probability of a link between two 
nodes is even larger if both have a high degree. Therefore, link 
queries between nodes for which we already discovered many 
links have an even higher probability of positive outcome. This 
leads to the following strategy. 



Strategy 5: (V-)Test-Between-Found — tbf^ (resp. W-tbf^) 
with k an integer 

Apply random^ (resp. V-randomk) 
foreach {u, v) E V' x V' in decreasing order of 
d'{u) +d'{v) do 
|_ Test (u, v) if it was untested 



Finally, one may try to combine the strategies above in 
order to improve their efficiency. Indeed, some of them use 
complementary principles which both help in discovering more 
links with less link queries. One may therefore expect even 
better results with combinations of them. We will therefore 
consider the following strategy. 



Strategy 6: (V-)TBF-Complete — tbfc^ (resp. W-tbfc^) 
with k an integer 

Apply tbfk (resp. V-f^/fc) 

Apply Co 



It must be clear that many variants and improvements of the 
strategies above are possible. Probably, completely different 
strategies may also be defined. Our goal here however is to 
evaluate the relevance of the degree principle and triangle 
principle in the design of measurement strategies. We therefore 
focus on these relatively simple strategies, which we consider 
as a natural first set of strategies derived from these basic 
principles. 

IV. Evaluation methodology 

For any measurement strategy S, let us define m'g{q) as the 
expected number of links discovered with q link queries with 
strategy 50. It must be clear that our goal, for a given q, is 
to design a strategy S that maximises m!g{q). Conversely, one 
may want to discover a given number x of links and ask for 
the strategy S that will minimize the q such that m'g{q) = x. 

However, given two numbers of queries q and r it is 
possible that a given strategy S discovers more links with 
q tests than another strategy T, while T discovers more with 
r tests (we will observe such a situation in Section [V-Bb . As 
a consequence, it makes no sense to say that S is better than 
T, nor the converse; this depends on the allowed number of 
link queries. 

Going further, one may notice that if S and T discover the 
same number of links after a given number q of tests, but if S 
discovers more links than T for any number r < qof test, then 
it seems natural to consider that S surpasses T (it discovers 
the same number of links, but faster). 

A simple way to formalise these intuitions is to define the 
efficiency of a strategy S for a given number of queries q 
as the (discrete) integral of the function m'g from to q: 

^Notice that, in practice, it is in general impossible to reach a situation 
where we test all pairs of nodes: q = ^ or conversely where we 

discovered all existing links: m'g (q) = m. 
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Notice that the obtained value will depend on the considered 
graph, and on q. It seems difficult to avoid this, as the 
efficiency of strategies do indeed depend on the graph under 
concern, and on the number of allowed link queries. We will 
therefore always compare strategies ran on the same graph and 
with the same number of link queries here. 

Another weakness of this definition is that it may give any 
positive value for the efficiency of a strategy, making it hard 
to evaluate how far from the worst or best solution we are. 
In order to avoid this we introduce the normalised efficiency: 
Sa(S) ~ ^<;(-S')-gi(min) yyjjgj-g j^-^jj^ ^jj^j jjiqx Stand for 

^ Cq (max)— Cq (mill) 

the worst and best strategies, i.e. the ones with minimal and 
maximal efficiencies. 

Notice that strategies min and max are easy to determine: 
min consists in testing pairs of nodes with no links between 
them as long as possible, thus — rn times, and then 

performing the positive tests; conversely max consists in 
performing first the m positive tests. As a consequence, we 
can compute easily fg(min) and £q{max) for any q, and thus 
obtain the normalized efficiency of any strategy. 

The notion of normalized efficiency however remains insuf- 
ficient. Indeed, as we consider sparse graphs, there are only 
very few positive link queries, and thus one may expect to 
be much closer to the min strategy than to the max. As a 
consequence, the efficiency of any strategy will be very low. 

A solution to this problem consists in comparing strategies 
to the random one, denoted by ran, which consists in per- 
forming link queries on random untested pairs of nodes. The 
expected efficiency of this strategy is easy to compute, as the 
probability of success of a link query is exactly the density 6; 
we obtain: £q{xw) — Ylt=i i ^ = 2 S. 

Finally, we introduce the relative efficiency, which indicates 
how a given strategy S performs compared to the random 
one (and the minimal and maximal ones) after q link queries: 

Notice that the relative efficiency does not give a value 
between and 1 and therefore does not have the advantage 
of being relatively independent from the context. However, 
we cannot normalize it as we would lose the benefit of the 
comparison to the random strategy. We will therefore use both 
the normalized efficiency and the relative efficiency to discuss 
efficiency of strategies below, and keep in mind that in any 
case the efficiency of a strategy depends on the graph under 
concern and on the number of link queries allowed. Only the 
full TOgO function can describe the efficiency of strategy S 
entirely, on a given graph. 

V. Experimental evaluation 

In this section, we present experiments aimed at illustrating 
the differences between the proposed measurement strategies, 
and how they may be evaluated. We first present the dataset 
we used, which is a typical real-world case. We then examine 
a typical situation and discuss the observations. We deepen 
this by observing the impact of the initial random period of 



measurement; and finally we discuss the bias that measurement 
strategies may induce on observed properties. 

A. Dataset 

We use here data on an online social network which we 
consider as a typical example of complex networks studied in 
the literature. This social network comes from the Flickr site, 
which provides facilities for publishing online photos, sharing 
them with others, discuss them, etc. Users may also subscribe 
to various interest groups and have lists of other users known 
as their contacts. 

Here we used a complete measurement of Flickr conducted 
in August 2006 [16]. We considered the largest of the 72 875 
groups observed then|f|, which contained 31523 members. 

We then defined three different networks among these 
3 1 523 users: 

• contact: two users a and b are linked if a is a contact of 
6 or 6 is a contact of a; 

• comment: two users a and h are linked if a posted a 
comment on a photo from 6 or 6 posted a comment on a 
photo from a; 

• symmetric-comment: two users a and 6 are linked if both 
a posted a comment on a photo from h and h posted a 
comment on a photo from a. 

One may also define a symmetric-contact graph in which 
two users a and h are linked if both a is a contact of b 
and 6 is a contact of a. In order to save space, we will not 
consider it here. Likewise, we do not detail the features of 
these networks; the key point here is that they are sparse, 
have heterogeneous degree distributions and high clustering 
coefficient and transitivity ratio. To this regard, they are similar 
to most real-world complex networks, and so the principles 
discussed in |ll] apply. 

B. A typical example 

Let us first try all our strategies with the same parameter 
k — \ 000 and on the contact graph. We represent in Figure [T| 
the number m'g{q) of links discovered by each strategy S as 
a function of the number q of link queries performed, for q 
between and Q = 4.10^. The obtained plot is representative 
of what is obtained on other graphs. 

This plot shows clearly that measurement strategies perform 
very differently, and that trying to optimize them is relevant. 
This is a first important result in itself. Moreover, both the de- 
gree principle and the triangle principle are useful in doing so: 
strategies based on each of them perform significantly better 
than the random strategy. However, the degree principle seems 
to be much stronger: while the improvement of W-random 
remains quite low, the improvement obtained by the complete 
strategy is huge. This is probably due to the fact that, although 
the clustering coefficient and transitivity ratio are much larger 
than the density, they remain quite small; instead, the largest 
degrees in the graphs are relatively close to its number of 
nodes. 

^ FlickrCentral, |http://flickr.com/groups/central/| 
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Fig. 1. Number of links (vertical axis) discovered by each strategy as a 
function of the number of link queries performed (horizontal axis) in a typical 
case (contact network, fc = 1 000). The tbf, thf-complete and V -tbf-complete 
strategies are indicated as a unique curve (named TBFs) in the plot as the 
three curves overlap eachother. 



The best final results (the largest number of links discovered 
at the end of the measurement) are obtained with mixed 
strategies, namely tbf-complete and \/ -tbf-complete , which 
succeed in discovering between 22 and 23% of existing 
Unks by performing only 1% of possible link queries. They 
slightly outperform complete, which was expected as they are 
more subtle (though a stronger improvement may have been 
expected). 

Notice that, although these strategies finally outperform 
complete, they discover links later than this strategy. In this 
sense, they may therefore be considered as less efficient, which 
is captured by our notion of efficiency, see Table U 





m' 


% tested 


% found 


£ 


n 


random 


9 609 


1.04 


1.03 


0.006 


0.99 


V-random 


21030 


1.04 


2.25 


0.010 


1.64 


ClOOO 


209485 


1.04 


22.4 


0.142 


24.2 


'^/lOOO 


68 874 


0.46 


7.36 


0.048 


15.6 


'^/^lOOO 


218 448 


1.04 


23.4 


0.131 


22.3 


v-?^/ciooo 


214175 


1.04 


22.9 


0.134 


22.7 



TABLE I 

Efficiency of each strategy after 4.10'^ links queries on the contact network: 
the number m' of discovered links; the percentage of tested pairs of nodes; 
the percentage of existing links found; and the efficiency coefficients £ 
and TZ. 



C. Impact of the initial phase 

In order to test the impact of the initial phase on the 
efficiency of the strategies, we conducted a similar experiment 
in which we increased the parameter A: to 1 500. We present 
the results in Figure |2l The change of the k value implies in 
particular that the random phase will last longer than in the 
previous runs as it looks for a larger set of discovered links. 
This induces a delay before the beginning of the second phase 
of the strategies which should in turn decrease their efficiency 
as they have less queries to test the existence of the links. 

Surprisingly though, the efficiency of the strategies does 
not seem to be affected. The amount of discovered links after 
the same number of queries is for instance comparable in the 
contact network case (around 21% of the existing links). This 



can be explained by the fact that while searching for the 1 500 
links, the random phase has improved the partial knowledge 
of the network topology. It is very likely then that the highly 
connected nodes have emerged more significantly during this 
phase. Thus the ordering used by the elaborated strategies, 
based on the degree principle, is in turn more pertinent. 

The plots based on the comment and symmetric-comment 
networks also show that the behaviour of the strategies can be 
very similar in some cases. This suggests to investigate other 
criteria to sort out their efficiency. 

D. Measurement bias 

Until now, we focused on our ability to discover many links 
with as few link queries as possible. However, different strate- 
gies discover different links, which may have consequences on 
the properties of the obtained samples: they may be biased by 
the measurement strategy, and biased differently depending on 
the strategy we use. This can be observed visually in Figure [3] 
for instance, and confirmed by the statistics given in Table |ll] 





Fig. 3. Drawings of samples obtained with the complete (left) and 
tbf-complete (right) strategies after the 20 000 link queries. The position of 
the nodes is the same in the two drawings (it is obtained by a classical graph 
drawing algorithm ran on the actual network), which makes it possible to 
observe visually that the hnks discovered by each strategy are not the same. 
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0.002 


35.5 


1708 


0.083 


0.124 


random 


6307 


0.000 


2.1 


38 


0.001 


0.001 


\J -random 


6248 


0.001 


3.1 


123 


0.133 


0.120 


C1500 


9840 


0.001 


13.0 


1708 


0.061 


0.422 


'*/l500 


2289 


0.024 


54.5 


663 


0.175 


0.208 




7717 


0.003 


20.0 


1708 


0.085 


0.371 


V-'^/Cl50() 


8789 


0.002 


17.7 


1708 


0.072 


0.388 



TABLE II 

Main statistical properties (number of Hnks finally discovered, density 5, 
average degree, maximal degree, clustering coefficient and transitivity ratio) 

of the samples obtained by each measurement strategy with k = 1500 
applied on the symmetric-comment network. We also display the properties 
of the actual network (first row), for comparison. 



These experiments clearly show that the observed properties 
are biased by the measurement (they are not the same as 
the ones of the actual network), and moreover that different 
strategies lead to different bias. 

One can notice for instance that the complete and the 
test-between-found strategies induce a very different bias on 
the properties. It is likely to be due to the fact that the number 
of involved nodes (m') is very low for the second strategy 
but all the possible links between them have been tested. 
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Fig. 2. Number of links (vertical axis) discovered by each strategy as a function of the number of link queries performed (horizontal axis) for each of our 
three graphs (from left to right: contact, comment and symmetric-comment), with k = 1 500. The tbf, thf-complete and \J -tbf-complete strategies are indicated 
as a unique curve (named TBFs) in the two last plots as the three curves overlap eachother. 



This means in particular that all the possible triangles have 
been discovered which leads naturally to an over-evaluation 
of the clustering coefficient. The strict opposite happens in 
the complete case since many nodes are discovered but the 
links between them are not directly tested. 

In some specific cases though, the values are correctly 
evaluated by the strategies. Strategy \/-random, for instance, 
gives a correct value of the transitivity ratio. This is well 
explained by the strategy itself that tests the existence of the 
third link of a triangle as soon as two nodes appear to have a 
common neighbor 

It is also worth noticing that the mixed strategies have 
a better evaluation of the clustering coefficient than other 
strategies. This can be explained by the fact that, as the name 
suggests, they mix the effects of the different strategies. In 
particular, the over-evaluation of this property given by the 
test-between-found phase seems to be compensated by the 
under-evaluation of the complete phase. 

These observations suggest to put in perspective the quan- 
titative assessments of the runs and to try integrating the 
qualitative point of view in the evaluation of the efficiency 
of the strategies. 

VI. Conclusion and perspectives 
In this paper, we studied the problem of measuring large 
complex networks when the measurement operation consists 
in testing the existence of a link between two nodes. We 
proposed different strategies for ordering the link queries in 
order to minimize their number while maximizing the number 
of discovered links. Those strategies rely on the expected 
statistical properties of the network in order to predict the 
existence of the links and we tested this approach on several 
real-world networks based on the Flickr database. 

The empirical results confirmed that the principles underly- 
ing the development of the strategies are relevant in this mea- 
surement context. The experiments showed that the elaborated 
strategies made a huge improvement compared to the random 
approach. But they also raised the question of accounting 
for the bias they induce on the extracted samples. It turned 
out that the different strategies gave different evaluations of 
the statistical properties of the original networks. This result 
suggests to try combining them in order to compensate those 
unwanted effects, which is what we plan to investigate more 
specifically in the future. 



Going further, we also intend to extend the kind of real- 
world networks on which test the strategies, add measurement 
properties to the list of statistical properties considered (such 
as the assortativity and the degree-degree correlation) and 
try to adapt the strategies to the directed graphs. Another 
interesting perspective would be to address the problem by 
means of formal approaches. 
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